---
title: "Example: Optimizing Information Density | RAG | Kastrax Docs"
description: Example of implementing a RAG system in Kastrax to optimize information density and deduplicate data using LLM-based processing.
---

import { GithubLink } from "@/components/github-link";

# Optimizing Information Density ✅

This example demonstrates how to implement a Retrieval-Augmented Generation (RAG) system using Kastrax, OpenAI embeddings, and PGVector for vector storage.
The system uses an agent to clean the initial chunks to optimize information density and deduplicate data.

## Overview ✅

The system implements RAG using Kastrax and OpenAI, this time optimizing information density through LLM-based processing. Here's what it does:

1. Sets up a Kastrax agent with gpt-4o-mini that can handle both querying and cleaning documents
2. Creates vector query and document chunking tools for the agent to use
3. Processes the initial document:
   - Chunks text documents into smaller segments
   - Creates embeddings for the chunks
   - Stores them in a PostgreSQL vector database
4. Performs an initial query to establish baseline response quality
5. Optimizes the data:
   - Uses the agent to clean and deduplicate chunks
   - Creates new embeddings for the cleaned chunks
   - Updates the vector store with optimized data
6. Performs the same query again to demonstrate improved response quality

## Setup ✅

### Environment Setup

Make sure to set up your environment variables:

```bash filename=".env"
OPENAI_API_KEY=your_openai_api_key_here
POSTGRES_CONNECTION_STRING=your_connection_string_here
```

### Dependencies

Then, import the necessary dependencies:

```typescript copy showLineNumbers filename="index.ts"
import { openai } from '@ai-sdk/openai';
import { Kastrax } from "@kastrax/core";
import { Agent } from "@kastrax/core/agent";
import { PgVector } from "@kastrax/pg";
import { MDocument, createVectorQueryTool, createDocumentChunkerTool } from "@kastrax/rag";
import { embedMany } from "ai";
```

## Tool Creation ✅

### Vector Query Tool

Using createVectorQueryTool imported from @kastrax/rag, you can create a tool that can query the vector database.

```typescript copy showLineNumbers{8} filename="index.ts"
const vectorQueryTool = createVectorQueryTool({
  vectorStoreName: "pgVector",
  indexName: "embeddings",
  model: openai.embedding('text-embedding-3-small'),
});
```

### Document Chunker Tool

Using createDocumentChunkerTool imported from @kastrax/rag, you can create a tool that chunks the document and sends the chunks to your agent.

```typescript copy showLineNumbers{14} filename="index.ts"
const doc = MDocument.fromText(yourText);

const documentChunkerTool = createDocumentChunkerTool({
  doc,
  params: {
    strategy: "recursive",
    size: 512,
    overlap: 25,
    separator: "\n",
  },
});
```

## Agent Configuration ✅

Set up a single Kastrax agent that can handle both querying and cleaning:

```typescript copy showLineNumbers{26} filename="index.ts"
const ragAgent = new Agent({
  name: "RAG Agent",
  instructions: `You are a helpful assistant that handles both querying and cleaning documents.
    When cleaning: Process, clean, and label data, remove irrelevant information and deduplicate content while preserving key facts.
    When querying: Provide answers based on the available context. Keep your answers concise and relevant.
    
    Important: When asked to answer a question, please base your answer only on the context provided in the tool. If the context doesn't contain enough information to fully answer the question, please state that explicitly.
    `,
  model: openai('gpt-4o-mini'),
  tools: {
    vectorQueryTool,
    documentChunkerTool,
  },
});
```

## Instantiate PgVector and Kastrax ✅

Instantiate PgVector and Kastrax with the components:

```typescript copy showLineNumbers{41} filename="index.ts"
const pgVector = new PgVector(process.env.POSTGRES_CONNECTION_STRING!);

export const kastrax = new Kastrax({
  agents: { ragAgent },
  vectors: { pgVector },
});
const agent = kastrax.getAgent('ragAgent');
```

## Document Processing ✅

Chunk the initial document and create embeddings:

```typescript copy showLineNumbers{49} filename="index.ts"
const chunks = await doc.chunk({
  strategy: "recursive",
  size: 256,
  overlap: 50,
  separator: "\n",
});

const { embeddings } = await embedMany({
  model: openai.embedding('text-embedding-3-small'),
  values: chunks.map(chunk => chunk.text),
});

const vectorStore = kastrax.getVector("pgVector");
await vectorStore.createIndex({
  indexName: "embeddings",
  dimension: 1536,
});

await vectorStore.upsert({
  indexName: "embeddings",
  vectors: embeddings,
  metadata: chunks?.map((chunk: any) => ({ text: chunk.text })),
});
```

## Initial Query ✅

Let's try querying the raw data to establish a baseline:

```typescript copy showLineNumbers{73} filename="index.ts"
// Generate response using the original embeddings
const query = 'What are all the technologies mentioned for space exploration?';
const originalResponse = await agent.generate(query);
console.log('\nQuery:', query);
console.log('Response:', originalResponse.text);
```

## Data Optimization ✅

After seeing the initial results, we can clean the data to improve quality:

```typescript copy showLineNumbers{79} filename="index.ts"
const chunkPrompt = `Use the tool provided to clean the chunks. Make sure to filter out irrelevant information that is not space related and remove duplicates.`;

const newChunks = await agent.generate(chunkPrompt);
const updatedDoc = MDocument.fromText(newChunks.text);

const updatedChunks = await updatedDoc.chunk({
  strategy: "recursive",
  size: 256,
  overlap: 50,
  separator: "\n",
});

const { embeddings: cleanedEmbeddings } = await embedMany({
  model: openai.embedding('text-embedding-3-small'),
  values: updatedChunks.map(chunk => chunk.text),
});

// Update the vector store with cleaned embeddings
await vectorStore.deleteIndex('embeddings');
await vectorStore.createIndex({
  indexName: "embeddings",
  dimension: 1536,
});

await vectorStore.upsert({
  indexName: "embeddings",
  vectors: cleanedEmbeddings,
  metadata: updatedChunks?.map((chunk: any) => ({ text: chunk.text })),
});
```

## Optimized Query ✅

Query the data again after cleaning to observe any differences in the response:

```typescript copy showLineNumbers{109} filename="index.ts"
// Query again with cleaned embeddings
const cleanedResponse = await agent.generate(query);
console.log('\nQuery:', query);
console.log('Response:', cleanedResponse.text);
```

<br />
<br />
<hr className="dark:border-[#404040] border-gray-300" />
<br />
<br />
<GithubLink
  link={
    "https://github.com/kastrax-ai/kastrax/blob/main/examples/basics/rag/cleanup-rag"
  }
/>
